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Abstract. There have been controversies among statisticians on (i) 
what to model and (ii) how to make inferences from models with unob- 
servables. One such controversy concerns the difference between esti- 
mation methods for the marginal means not necessarily having a proba- 
bilistic basis and statistical models having unobservables with a proba- 
bilistic basis. Another concerns likelihood-based inference for statistical 
models with unobservables. This needs an extended-likelihood frame- 
work, and we show how one such extension, hierarchical likelihood, 
allows this to be done. Modeling of unobservables leads to rich classes 
of new probabilistic models from which likelihood-type inferences can 
be made naturally with hierarchical likelihood. 

Key words and phrases: Hierarchical generalized linear model, unob- 
servables, random effects, likelihood, extended likelihood, hierarchical 
likelihood. 



1. INTRODUCTION 

Fisher introduced the concept of likelihood in 1921 
for inferences from statistical models involving two 
kinds of objects, namely observed random variables 
(data) and unknown fixed parameters. Pearson (1920) 
points out a limitation of Fisher likelihood for the 
prediction of unobserved future observations. Fisher's 
likelihood cannot be used to make inferences about 
unobservables. There has been an effort to extend 
likelihood inferences to models with unobservables 
by eliminating them via integration. However, with 
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a few exceptions such as the copula (Joe, 1997), 
marginal distributions for counts and proportions 
are not available in explicit forms, and this restricts 
the scope of the classical likelihood approach. 

In longitudinal studies, generalized estimating equa- 
tions (GEEs) are widely used. They give an estima- 
tion method for regression coefficients constructed 
directly to describe marginal means with the covari- 
ance structure regarded as contributing nuisance pa- 
rameters only. However, GEEs cannot (generally) be 
integrated to obtain a likelihood function (McCul- 
lagh and Nelder, 1989) and therefore may not have 
a probabilistic or likelihood basis. These estimation 
methods for marginal (or population-average) means 
are often contrasted with conditional (or subject- 
specific) models which include the modeling of un- 
observables. Jansen et al. (2006) review the use of 
GEE methods and conditional models for analysis of 
missing data and discuss the choice between them. 
However, we believe that such a choice is inappro- 
priate because the choice of an estimation method 
for a particular parameterization (marginal param- 
eter) should not pre-empt the process of model se- 
lection. Recently, Lee and Nelder (2004) have shown 
that alleged differences in the behavior of parame- 
ters between GEE methods and conditional mod- 
els are based on a failure to compare like with like. 
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We dislike the use of estimation methods without a 
probabilistic basis because, for example, inferences 
for joint and conditional probabilities are not possi- 
ble. 

Recently, broad classes of new probabilistic mod- 
els with unobserved random variables (unobserv- 
ables) have been proposed, such as generalized lin- 
ear models (GLMs) with random effects (Lee and 
Nelder, 1996), latent processes (Skrondal and Rabe- 
Hesketh, 2007), models for missing data (Little and 
Rubin, 2002), prediction (Bj0rnstad, 1990) and for 
potential outcomes in causality (Rubin, 2006), etc. 
In the statistical literature unobservables appear with 
various names such as random effects, latent pro- 
cesses, factor, missing data, unobserved future ob- 
servations, potential outcomes, and so on. Random 
effects in the mean model have been proposed to ac- 
count for within-subject correlation in longitudinal 
studies (Diggle, Liang and Zeger, 1996) (for smooth 
spatial data, see Besag and Higdon, 1999; for spline- 
type function fitting, see Eilers and Marx, 1996; 
and for factor analysis, see Bartholomew, 1987, etc.) 
while random effects in the dispersion model (Lee 
and Nelder, 2006a) can account for heteroskedas- 
ticity, giving heavy-tailed distributions that allow 
robust modeling (Noh and Lee, 2007a). 

Modeling of unobservables is the key to these new 
models. However, because of difficulties in making 
likelihood inferences about unobservables, some au- 
thors use the Fisher likelihood for inferences about 
fixed unknown parameters while for inferences about 
unobservables they use the empirical Bayesian (EB) 
approach or the full Bayesian (FB) inference. Re- 
cently, Zhao et al. (2006) have used an FB approach 
with which they claim to have an advantage over the 
frequentist version (EB) in that it is computationally 
simpler to obtain variance estimates of the random- 
effect estimates. (Note that the word "prediction" 
has often been used to denote the estimation of ran- 
dom effects. However, we believe that it is clearer 
to use prediction when we estimate future observa- 
tions (unobservables) and estimation for the estima- 
tion of random effects in the data already observed.) 
Discussing the controversy between Fisher and Ney- 
man, Rubin (2005) maintains that models with un- 
observables arose most naturally in causal inference 
within an FB framework. From Lindley and Smith 
(1972) onwards, FB has become dominant for the 
analysis of models with unobservables. The avail- 
ability of Markov-Chain Monte Carlo, which imple- 
ments FB procedures, has made FB inferences pop- 
ular. 



By contrast we believe that modeling of unobserv- 
ables is natural within an extended likelihood frame- 
work. Recently, for general inferences from models 
involving unobservables, Lee and Nelder (1996) pro- 
pose the use of the hierarchical (or h-)likelihood. 
The h-likelihood plays a key role in the synthesis of 
the likelihood inferential tools needed for a broad 
class of new models having unobservables. The h- 
likelihood approach takes into account the uncer- 
tainty in the estimation of random effects, so that 
inferences about unobservables are possible without 
resorting to an EB framework. 

In the next section we review some models with 
unobservables and discuss related modeling issues. 
We review the h-likelihood procedure for the estima- 
tion of random effects and compare them with the 
Bayesian approach in Section 3; likelihood inferences 
from such models are demonstrated with examples 
in Section 4, followed by conclusions in Section 5. 

2. HOW TO MODEL UNOBSERVABLES 

Multivariate distributions for non-Gaussian mod- 
els can be produced by probabilistic modeling of un- 
observables without requiring explicit multivariate 
generalizations of non-Gaussian distributions. Us- 
ing hierarchical likelihood, inferences from these new 
classes can be made. 

2.1 HGLMs: Random Effects in the Mean 

HGLMs allow a synthesis of GLMs, random-effect 
models and structured-dispersion models. Consider 
a GLM with random effects where the response y 
follows the GLM, conditioning on random effects v, 

(2.1) [i = E(y\v) and v&r(y\v) = 4>V(fj.) 
with a linear predictor, 

(2.2) r] = Xf3 + Zv, 

where rj = g(/j,) for some monotonic function g(-). 
When v are normal the models are called generalized 
linear mixed models (GLMMs). The use of other dis- 
tributions for the random effects enriches the class 
of models. Lee and Nelder (1996) introduce HGLMs 
in which the distribution of the random components 
is extended to an arbitrary conjugate distribution of 
a GLM family with an appropriate link, not neces- 
sarily that of the conjugate pair. Above we suppress 
the indices to mean that our discussion covers vari- 
ous models having single or multiple random effects 
with nested, crossed, combined structures, etc. We 
write indices if necessary. 
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To allow various patterned associations among ran- 
dom effects Lee and Nelder (2001b) propose adding 
an additional feature to HGLMs as follows: Let v = 
Lr with r being random effects with a diagonal co- 
variance matrix var(r) = A to give 



var (V) 



S = LAL . 



The last equation can be a spectral decomposition 
with an orthogonal matrix L or a Choleski decom- 
position with upper or lower triangular matrix L. 
Zhao et al. (2006) note that the full generality of 
the GLMM requires using general design matrices 
for both fixed and random components. With fixed 
L, not depending upon unknown parameters, we 
have models for longitudinal studies, intrinsic au- 
toregressive models, various spline models, etc. With 
parameter-dependent L we have random-slope mod- 
els, autoregressive models, antedependence models, 
Markov-random-field models, and so on (Lee and 
Nelder, 2001b). These models are also able to han- 
dle a great range of complications in regression- type 
analysis, for instance, within-subject correlation in 
longitudinal data, scatterplot smoothing, general- 
ized additive models, Kriging, function estimation 
and non-parametric regression models such as gener- 
alized additive models and varying-coefficient mod- 
els (Zhao et al., 2006). 

Example 1: Consider the model from item-response 
theory (IRT) such that 



ex.p(vjj - /3j) 
1 + exp(vij - (3j 



where fij is the intrinsic difficulty of the jth item, 
and Vij is the ith subject's ability for the jth item. If 
Vij = Vi with Vi ~ iV(0, A) it becomes a one-parameter 
IRT model (Rasch, 1960). An appealing feature of 
this model is that items and subjects (examinees) 
can be placed on a common scale. Differences in 
both difficulty between items and ability of subjects 
is assumed to remain the same. In this model, for 
a given item, the probability of a correct response 
increases monotonically with ability as in Figure 1. 

If = TiOLj with rj ~ A^(0, A) and ctj fixed un- 
known, we have a two-parameter IRT model. Let 
Vi = (va, . . .,VikY and Li = (o-i, . ..,a k )\ giving 

t Art 



var(uj) 



where A = A is a one-by-one matrix. This model al- 
lows for correlations among items for each subject. 
In this model otj is called the discriminant param- 
eter and Pj = (3j/ctj the difficulty parameter (Skro- 
ndal and Rabe-Hesketh, 2007). This two-parameter 
IRT model may lack the monotonicity property in 
that one item can be easier than another for one 
subject, while being more difficult for another; this 
is described by the item-subject interaction riOj. 
This example shows how a particular modeling of 
the (singular) covariance matrix Sj can give an in- 
teresting interpretation of the parameters. 

Example 2: When vt = pvt-i + rt with var(r^) = A 
we have autoregressive random effects of order 1. 
When p = 1 we have the random-walk model which 
gives a singular precision matrix. This random-walk 
model for temporal correlation has been extended 




Fig. 1. Curves ofPx{yij = l\vi) with respect to v (left) in a one-parameter IRT model and r (right) in a two-parameter IRT 
model. 
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to spatially-correlated models via intrinsic autore- 
gressive models with a singular fixed-precision ma- 
trix (Besag and Kooperberg, 1995). Splines can be 
viewed as smoothing via random effects which also 
have a singular fixed-precision matrix (Green and 
Silverman, 1994). 

Example 3: Skrondal and Rabe-Hesketh (2004) 
propose generalized linear latent and mixed mod- 
els (GLLAMMs) as a means of unifying factor mod- 
els, linear structural-relations models and covariate 
measurement-error models. They point out that 
GLLAMMs consist of two building blocks, a response 
model and a structural model. For the response 
model, they use the HGLM shown in equation (2.2). 
For the structural model, the random effect itself 
satisfies a regression model of the form 

v = Bv + Tw + r, 

where B is a matrix of structural parameters re- 
lating the latent dependent variables to each other, 
r is a matrix of structural parameters relating the 
latent dependent variable to the latent explanatory 
variables and r is a vector of disturbances. From this 
we have 

v = (I- B^Tw + (I — BY 1 r. 

Thus, the GLLAMMs can be represented as an 
HGLM with two random components, 

77 = g (fj,) =X(3 + Zv = Xf3 + ZL lW + ZL 2 r, 

where L x = (I - S) _1 r and L 2 = (I - B)' 1 . In 
GLLAMMs the parametrization using B, T, var(w) 
and var(r) gives a useful interpretation. 

Another class of widely used models with unob- 
servables is nonlinear mixed-effect models in pop- 
ulation pharmacokinetics and pharmacodynamics, 
models for missing data and models for potential 
outcomes. 

2.2 Random-Effect Models for the Dispersion 

Lee and Nelder (2006a) introduce double HGLMs 
(DHGLMs) which allow random effects for the dis- 
persion. This gives a systematic way of generating 
heavy-tailed distributions for various types of data 
such as counts, proportions, and so on. Random ef- 
fects in the mean affect the first two cumulants of the 
distribution of responses while those in the disper- 
sion affect the third and fourth cumulants, so that 
by allowing random effects in both mean and disper- 
sion we can generate models with various patterns 
in the first four cumulants. Castillo and Lee (2008) 



show that DHGLMs provide a general treatment 
of Levy-process models in financial modeling while 
Noh and Lee (2007a) show that this new class allows 
robust modeling of GLM classes with bounded influ- 
ence. Yun and Lee (2006) show how to model abrupt 
changes in the behavior of schizophrenics. Glidden 
and Liang (2002) show that sensitivity of estimators 
for j3 from HGLMs become more serious when the 
data form a selected sample. However, Noh et al. 
(2005) show that by using a heavy-tailed distribu- 
tion for the random effects, such a sensitivity in the 
estimators can be avoided. 

2.3 Probabilistic and Nonprobabilistic Methods 

Without introducing random effects the GEE can 
be used to obtain maximum likelihood (ML) estima- 
tors when responses are normal. Estimates of regres- 
sion coefficients from GEEs have been claimed to be 
consistent under various model misspecifications. It 
is often called the population-averaged model (Zeger 
et al., 1988) or the marginal model (Jansen et al., 
2006) for a particular parameterization [regression 
coefficients for marginal means E(y)]. For correlated 
non-normal responses, given a GEE U (/3 S ) = dq/d(3 s = 
(let us say) , the mixed derivatives may not be the 
same (McCullagh and Nelder, 1989, page 337), that 
is, 

d 2 q/d/3 s dPr = dU{p s )/dp r ^ dU(P r )/d/3 s 

= d 2 q/dp r dP s ; 

if so there is no probabilistic model leading to the 
GEE U(/3 S ) = 0. Without such a basis the claim of 
consistency is meaningless (for more discussion see 
Crowder, 1995 and Chaganty and Joe, 2006). 

It is of interest to study the class of marginal mod- 
els, allowing estimating equations. Various marginal 
models have been proposed by Molenberghs and 
Lesaffre (1994), Molenberghs et al. (2007) and Hea- 
gerty and Zeger (2000). Heagerty and Zeger (2000) 
claimed that the parameter estimates from their 
marginal models were less sensitive to the misspec- 
ification of the distribution of random effects. Lee 
and Nelder (2004) show that if one compares like 
with like the differences between the results from 
the two models are not great. All that we can say 
is that certain parameterizations are less sensitive 
under certain probabilistic models so that it could 
be recommended to use such a parameterization if it 
also met scientific requirements. For further contro- 
versies on parameterizations see Lindsey and Lam- 
bert (1998). 
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GEE is an estimating method, not a model. Thus 
we do not believe that a useful comparison can be 
made between a probabilistic model such as a HGLM 
and an estimating method such as GEE. We see the 
analysis of data as consisting of three main activi- 
ties: the first two are model fitting and model check- 
ing which aim to find parsimonious well-fitting mod- 
els, and together comprise model selection; the third 
is model prediction, where parameter estimates from 
selected models are used to predict quantities of in- 
terest and their uncertainties. In our view, inferences 
about margins and individual subjects' responses 
and a choice of an estimation method such as the 
GEE, ML, etc., both belong to the prediction phase 
of the analysis. 

In this paper we shall not consider GEE further 
because the method does not allow inferences about 
unobservables. 

3. EXTENDED LIKELIHOOD VERSUS 
BAYESIAN APPROACHES 

Besides the observed data and fixed unknown pa- 
rameters in Fisher likelihood, an additional type of 
object, namely unobservable random variables v, is 
often of interest in making statistical inferences. 

Example 4: Suppose that we have the number of 
epileptic seizures in an individual for five weeks, 
y = (3, 2, 5, 0, 4). Suppose that these counts are i.i.d. 
from a Poisson distribution with mean 9. Now we 
want to have a predictive probability function for 
the seizure counts for the next week v. Here, 9 = 
(3 + 2 + 5 + + 4)/5 = 2.8, so that the plug-in tech- 
nique gives the predictive distribution for the seizure 
count v of the next week 

f§(v = i\y) = f§(v = i) = exp(-2.8)2.87i!. 

Pearson (1920) points out the limitation of Fisher 
likelihood using the plug-in method because it can- 
not account for uncertainty in estimating 0. 

Example 5: Suppose that the data Y are collected 
from the statistical model fg(Y;9). Suppose also 
that some of the intended observations in Y are 
unobservable because they are missing. We write 
Y = (y bs,ymis) for y bs the observed and y mis for 
the missing components. Let r be missing data in- 
dicators such that 

1, if Yi is missing, 
0, if Yi is observed. 

This leads to a probability function 

fe(Y,r;9) = fe(Y)f e (r\Y). 



Here y= (y bs> r ) are the observed data, and y m i s 
are the unobservables. 

From these models, likelihood inferences can be 
made using the h-likelihood defined by 



(3.1) 



h = h(6,v) = log f e (y\v) + \ogf e (v) 
= log fo(y,v) =m + logf e {v\y) 



where m is the marginal log- likelihood m = log fe(y) 
with f e (y) = f fo(y\v)fe(v)dv. This is the (log) h- 
likelihood which plays the same role as the 
log-likelihood m in Fisher's likelihood inference for 
models without unobservables. In forming the h- 
likelihood the choice of the scale for v is important 
(Lee et al., 2006) because the mode and its curva- 
ture are used for inferences as we shall discuss. 

Throughout this paper we use /#(■) to denote prob- 
ability functions of random variables with fixed pa- 
rameters 9; the arguments within the brackets can 
be either conditional or unconditional. Thus fe(y\v) 
and fo(v\y) have different functional forms though 
we use the same fg(-) to mean probability functions 
with parameters 9. 

3.1 Bayesian Inferences 

If we assume a prior 7r(0) on parameters 9, Bayesian 
inferences can be made. The posterior is 

ir(9,v\y) oc ir(y\v, 9)tt(v\9)tt(9), 

where n(y\v,9) = fe(y\v) and tt(v\9) = fe{v). Here 
9 is also unobservable and is eliminated by integra- 
tion. Let 9-i = (0i,...,0 i _i,0 i+ i,...,0 p ) T . For 
Bayesian inferences the following various marginal 
or conditional posteriors have been used: 

w(9\y) = / n(9,v\y)dv, 



7r{9i\y)= / n(9,v\y)dvd9. 



n(vi\y)= / n(9,v\y)dv-id9, 



ir(vi\y,9) = j ir(v\y,9)dv_i. 

In this paper full Bayesian (FB) inference is assumed 
to use the marginal posteriors 7r(0j|y) and Tr(vi\y) 
while empirical Bayesian (EB) inference (Morris, 1983) 
uses the conditional posteriors n(vi\y,9) where 
are the ML estimators maximizing the likelihood 
fe(y) = 7r (0|y) — / tt(0, v\y) dv under the uniform prior 

7T(0) = 1. 
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3.2 Adjusted Profile H-likelihoods and 
Likelihood Inference 

The likelihood principle of Birnbaum (1962) states 
that Fisher's marginal likelihood fg(y) carries all 
the (relevant experimental) information in the data 
about the fixed parameters 9 so that fo(y) should 
be used for inferences about 9 (see also Berger and 
Wolpert, 1984). For estimating fixed parameters 9 
we follow the likelihood principle by using the ML 
estimator from fg(y). We view the marginal like- 
lihood as an adjusted profile likelihood eliminat- 
ing nuisance unobservables v from the h-likelihood. 
However, the computation of ML estimators can 
be a complex task because of intractable integra- 
tion. For example, in the Salamander data (Mc- 
Cullagh and Nelder, 1989) marginal-likelihood infer- 
ence, based upon numerical integration using Gauss- 
Hermite quadrature, is not feasible since a 
120-dimensional integral is required. 

Let 



log /e(y) = log / exphdv 



be the (log-) marginal likelihood. Let I = l(a,ijj) be 
a likelihood, either a marginal likelihood £ or an 
hierarchical likelihood h, with nuisance parameters 
a. Lee and Nelder (2001a) introduce a function, 
p a (l;ip), defined by 



(3.2) 



-logdet{D(Z,a)/(27r)} 



where D(l,a) = —d 2 l/da 2 and a solves dl/da = 
0. These p(-) functions define adjusted profile h- 
likelihoods (APHLs). If ir(9) = 1 the Bayesian poste- 
rior is identical to the h-likelihood, n(9, v\y) = fe(y, v) 
Thus APHLs can have a Bayesian interpretation; 
for example p V -i,e(h;Vi) is the Laplace approxima- 
tion to the marginal posterior w(vi\y), eliminating 
(v-i,9) by integration. When ir(9) = 1, it is not a 
probability if the domain is the whole real line or the 
positive real line. However, as long as the marginal 
posterior is proper (finite), n(vi\y) would be consid- 
ered as a valid posterior (Berger, 1985). 

APHLs also allow a likelihood interpretation. Here 
p v (h; 9) is the Laplace approximation to the marginal 
likelihood I obtained by integrating over unobserv- 
ables v (Lee and Nelder, 2001a); its maximum gives 
approximate (marginal) ML estimators for f3. In like- 
lihood inferences fixed parameters are eliminated by 



conditioning (if available) or profiling (in general). 
Now suppose that the parameters in a model can 
be divided into location parameters (3 and disper- 
sion parameters a 2 . Note that pp(l\a 2 ) is an ad- 
justed profile likelihood that approximates the con- 
ditional log-likelihood obtained by conditioning on 
the marginal ML estimator /3 to eliminate the fixed 
unknown parameter j3 (Cox and Reid, 1987). A well- 
known exact example of this is the use of restricted 
likelihood in linear mixed models. Furthermore, 
pg(h;v) is Davison's (1986) predictive likelihood for 
v, eliminating nuisance fixed parameters 9. The APHL 
Pv-i,9(h',Vi) eliminates V-i by integration and 9 by 
conditioning on 9. When orthogonality does not hold 
between parameters we use a profile likelihood to 
eliminate nuisance parameters. To simplify the no- 
tation we sometimes suppress arguments; for exam- 
ple we use p v (h) instead of p v {h(v, j3, a 2 ); /3, a 2 } = 
p v (h; (3, a 2 ) if this does not lead to ambiguity. 

Lee and Nelder (1996, 2001a, 2006a) propose max- 
imizing the h-likelihood h for the estimation of v, 
the marginal likelihood I for the ML estimators for f3 
and the restricted likelihood pp(£) for the dispersion 
parameters a 2 . Thus our position is consistent with 
the likelihood principle by using the marginal likeli- 
hood for inferences about 9. However, when I is nu- 
merically hard to obtain, we propose to use adjusted 
profile h-likelihoods (APHLs) p v (h) and pp )V {h) as 
approximations to I and pp{l) \ Pf3, v {h) approximates 
the restricted log-likelihood. Second-order Laplace 
approximations may sometimes be useful to improve 
accuracy. 

Many numerical studies on h-likelihood have shown 
that this development gives practically satisfactory 
estimates of parameters in many models where the 
ML estimators are hard to compute. For binary data 
Noh and Lee (2007b) show numerically that the h- 
likelihood estimator for 9 has less bias and mean 
square error than various other methods developed 
by Schall (1991), Breslow and Clayton (1993), Drum 
and McCullagh (1993), Shun and McCullagh (1995), 
Lin and Breslow (1996) and Shun (1997): see also 
the simulation studies of frailty models (Ha and Lee, 
2005) and of mixed linear models with censoring (Ha 
et al., 2002). In the salamander data, among other 
methods considered, the MCEM of Vaida and Meng 
(2004) gives the closest estimates to the h-likelihood 
estimators. 

Little and Rubin (2002) provide an extensive re- 
view of the analysis of missing data and claim that 
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h-likelihood methods are inappropriate for the esti- 
mation of 9 in missing-value settings such as that 
in Example 5. They appear to wrongly equate h- 
likelihood estimation to a joint maximization of mean 
and dispersion parameters. Yun et al. (2007) show, 
in contrast to this assertion, that when applied ap- 
propriately h-likelihood methods are both valid and 
efficient in such settings. In non- linear mixed-effect 
models the h-likelihood can also improve on existing 
methods (Noh and Lee, 2008). 

3.3 APHLs versus Marginal Posteriors 

In the Bayesian approach, simulation techniques 
such as MCMC are often used to compute the 
marginal posteriors. Consider the Epil example of 
the OpenBUGS manual, volume 1 (Thomas et al., 
2006). The data come from a clinical trial of 59 
epileptic patients. Each patient i is randomized to a 
new drug (Tj = 1) or a placebo (Tj = 0). The obser- 
vations for each patient yn, . . . , y^ are the number 
of seizures during the 2 weeks before each of four vis- 
its. The covariates are age (Ai), the baseline seizure 
counts (Bi) and an indicator variable for the fourth 
clinic visit (VA). Consider the HGLM, 

r]ij = PQ + ^Blog{B i /^) + p T T i 

+ P T xBT i xlog(B i /4) + p A A i 

+ f3 v V4: + Vi + w ijt 

using centered covariates with Vi ^ N(0,a 2 ) and 
Wij ^ iV(0,(T^). In discussing the paper by Rue et 
al. (2009) on Bayesian inferences based on priors 
<7~ 2 , a~ 2 ^ gamma(0.001, 0.001), Lee shows Figure 2 
(of this paper) for the marginal posteriors, ir(vi\y), 
nifirly) and ir(a 2 \y), from OpenBUGS (Thomas et 
al., 2006) and the corresponding APHLs, p V - 1 ,w,e(h; 

vi), Pv,w(h; f3 T ,d(^T)) and {h; o 2 v ,al{ol)) 

where 9(a) are the ML estimators of remaining (3 
and the REML estimators for the dispersion param- 
eters at (3t = a and ^(a) is the REML estimators 
of a\, at a 2 = a. Figure 2 shows almost identical 
plots for both random and fixed effects. However, 
the plots for the dispersion components are differ- 
ent because the inverse-gamma prior of Rue et al. 
(2009) is informative. This leads to biases when dis- 
persion parameters are not random but are fixed un- 
knowns, as in disease mappings (Jang et al., 2007). 
Thus without MCMC samplings similar information 
could be obtained from the extended likelihood un- 
less the assumed prior is informative. Thus, likeli- 
hood inferences can be made without the necessity 
of inventing priors for parameters. 



4. LIKELIHOOD INFERENCE FOR 
UNOBSERVABLES 

The extended likelihood principle of Bj0rnstad 
(1996) shows that extended likelihood, of which h- 
likelihood is a special case, carries all the informa- 
tion in the data about the unobserved quantities v 
and 9. Bedrick and Hill (1999) study the use of ex- 
tended likelihood as a summary function for 
unobservables. In this paper we discuss its use as 
an estimating tool. 

Consider the prediction problem in Example 4 
where the plug-in technique fg(v = i) = fg(v = i\y) = 

ir(v = i\y, 9) can be viewed as the EB. With Jeffreys' 
prior, tt(9) oc 9~ 1 / 2 , the resulting marginal poste- 
rior ir(v\y) gives a predictive probability with higher 
probabilities for larger y. Pawitan (2001) considers 
the h-likelihood, proportional to 

f e (3, 2, 5, 0, 4, v) = exp(-6#)# 3 + 2+5+0+4+w 
/(3!2!5!0!4h;!). 

Here 9(v) = (3 + 2 + 5 + + 4 + v)/6. Then the nor- 
malized profile likelihood (3, 2,5,0, 4, v) gives the 
predictive distribution of Mathiasen (1979) almost 
identical to Pearson's but without assuming a prior 
on 9 (Figure 3) (for more discussion, see Bj0rnstad, 
1990). This example shows that standard methods 
for likelihood inferences can be used for the predic- 
tion problem. In the next section we illustrate how 
to use standard likelihood methods to overcome a 
drawback of EB method. 

4.1 EB Versus H-likelihood Methods 

Because the Fisher likelihood fo(y) in (3.1) does 
not involve v, the other component (the conditional 
posterior) fe(v\y) = ir(v\y,9) seems to carry all the 
information in the data about the unobservables. 
Thus an inference would be based solely upon the 
estimated posterior, 

fe( v \v) =7r {v\yJ), 

where 9 are usually the ML estimators (Carlin and 
Louis, 2000). Using f§(v\y) to make inferences about 
v is naive, and Bj0rnstad (1990) shows how badly 
it performs in measuring the true uncertainty in esti- 
mating v. Note that maximization of the h-likelihood 
(3.1) yields EB-mode estimators for v without com- 
puting fo(v\y). However, the Hessian matrix based 
upon the estimated posterior f§(v\y) gives a naive 
variance estimate for the prediction v because it 
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Fig. 2. The marginal posteriors (■■■) versus APHLs (—). 
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Fig. 3. Predictive density of the number of seizure counts: Plug-in method (A), Bayesian method (o) and h-likelihood method 
(+)■ 



does not properly account for the uncertainty caused 
by estimating 9. Note that the marginal posterior 
variance is 

vax(vi\y) = E \ y [\ax(vi\y,9)] 

(4.1) 

+ vax e \y[E(Vi\y,6)\. 

Carlin and Gelfand (1990) note that the naive EB 
variance estimate only approximates the first term 
in the equation above. Laird and Louis (1987) and 
Carlin and Gelfand (1990) propose to use the boot- 
strap method to estimate the second term. In this 
paper the FB method uses the marginal posterior 
n(vi\y). 

Up to now most studies on h-likelihood methods 
have been about the efficiencies of parameter esti- 
mates. Here we discuss how to compute the vari- 



ance of estimated random effects. We see that in- 
ferences about random effects cannot be made by 
using only fe(v\y) as the EB method does. Because 
fe(v\y) involves the fixed parameters 9 we should 
use the whole h-likelihood to reflect the uncertainty 
in estimating 9; it is the other component fe(y) 
which carries the information about this. By using 
the h-likelihood, complete likelihood inferences can 
be made not only for 9 but also for v and their com- 
binations. 

Given 9 let v{9) be a random-effect estimator solv- 
ing dh/dv = 0. As a variance of random-effect esti- 
mators Booth and Hobert (1998) recommend using 
the conditional mean square error (CMSE) defined 
by 

CMSE(v) = E{{v(9) - v){v(9) - v)'\y} 



9 



(4.2) 



vai e (v\y) + D(9), 



where vaxg(v\y) = E{(v(9) — v)(v(9) — v)'\y} 
and D(9) = E{(v(9) - v(9))(v(9) - v(9))'\y} is the 
inflation of the CMSE caused by estimating 9. The 
EB estimator, the inverse of the Hessian matrix from 
log fg(v\y), gives an estimator for the first term 
varg(v\y) in (4.2). Thus it could give severe under- 
estimation if D(9) is large. Lee and Nelder (1996) 
note that in HGLMs (2.2), the location parameters 
(v,(3) and dispersion parameters a 2 =($, S) are or- 
thogonal so that we need consider only the variance 
inflation caused by estimating (3. The Hessian ma- 
trix of /3 and v is given by 



(4.3) J(M 



d 2 h/dpdp' 
d 2 h/dvd(3' 



d 2 h/d/3dv' 
d 2 h/dvdv' 



Here the EB variance estimator is given by —(d 2 h/ 
dvdv')~ 1 \ 0= Q. Lee and Ha (2010) show that in gen- 
eral the inverse of the Hessian matrix (4.3) gives an 
approximation to the CMSE (4.2). Before we dis- 
cuss the general use of this method we investigate 
a simple example which shows issues related to this 
problem. 

4.2 Bayarri's Example 

Bayarri et al. (1988) try to show by an example 
that likelihood inference is not possible for general 
models with unobservables. Suppose that there is a 
single fixed parameter 9, a single unobservable ran- 
dom quantity u and a single observable quantity y. 
An unobserved random variable u has a probability 
function 

fg(u) = 6exp(—0u) for u > 0, 9 > 0, 

and an observable random variable y has conditional 
probability function 

fe{y\u) = f{y\u) =uexp(-uy) for y > 0,u > 0, 

free of 9. Besides f(y\u), they considered the fol- 
lowing two additional possibilities for an extended 
likelihood for models with these three kinds of ob- 
jects: 

f'M = JeTy?' 

fe(y,u) = u9ex.p{-u(9 + y)}. 

The marginal log-likelihood m = log fg (y) gives the 
ML estimator for 9 but is totally uninformative about 
the unknown value of u. The conditional likelihood 



f(y\u) is uninformative about 9 and loses the rela- 
tionship between u and 9 reflected in f$(u). Finally, 
the extended likelihood fg(y,u) yields, if maximized 
jointly with respect to 9 and u, the useless estima- 
tors 9 = oo and u = 0. Bayarri et al. (1988) therefore 
conclude that none is useful as a likelihood for com- 
plete inferences, so that Bayes is the only method 
for inferences from general models. 
The h-(log)-likelihood is given by 

h = log fg(y,v) = log fg(y,u) + log \du/dv\ 

= 2v + log9 - u(9 + y), 

where v = log u with v being the canonical scale in 
which the joint maximization of h with respect to 9 
and u gives the ML estimator of 9 (Lee et al., 2006a). 
Suppose that the marginal likelihood is hard to ob- 
tain. The Laplace approximation is proportional to 
m = log fg(y) and gives the ML estimator 9 = y and 
its variance estimator 



var(0) = -{d 2 m/d9 2 L =§ Y 



2y v 



Given 9, the estimating equation dh/du 
the best estimator of u (Robinson, 1991), 

2 



gives 



u(e) = E(u\y) 
from which we have 
u{9) 



e + y 



o + y 



i 

y' 



Furthermore, we have 



mm) 



Note here that 
v&ig(u\y) = 



d 2 h/d9 2 
d 2 h/dud9 



d 2 h/d9du 
d 2 h/du 2 



1/9 2 
1 



E{(u(9) 



1 



(y + 9) 2 /2 



u) 2 \y} 



2/{y + 9f 



so that EB gives var e (u|y) = l/(2y 2 ). Here D(9) = 
E]{l/y - 2/(9 + y)} 2 \y} = (y - 9) 2 /{y(y + 9)} 2 = 
(9 - 9) 2 /{y(y + 9)} 2 , so that, following Booth and 
Hobert (1998), if we estimate (9 - 9) 2 by var(6>) we 

have D{9) = 2 y 2 /^ = 1 /( 2 2/ 2 )- Th us the estimator 
for the CMSE is 1/y 2 , which can be obtained from 
the corresponding element in the Hessian matrix 
1(9, u(9)). An alternative justification is that the 
h-likelihood variance estimator is estimating the un- 
conditional mean-square error because 
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E{(u(9) - u) 2 } = l/y 2 from E{(u(§) - u) 2 } = I/O 2 
(Lee et al., 2006, page 116). 

With this small example we illustrate how the h- 
likelihood gives complete likelihood inferences, giv- 
ing the ML inference for 6 and improved EB infer- 
ence by accounting for the uncertainty caused by 
estimating 9. 

4.3 H-likelihood Inferences About v 

The example shows that between extended likeli- 
hoods fg(y,u) and fg(y,v) the mode of the 
h-likelihood fg(y, v) gives a meaningful estimator for 
v, while that of fe(y,u) gives a meaningless one. 
Given that extended likelihoods should serve as the 
basis for statistical inferences of a general nature, 
we want to find a particular scale whose mode gives 
meaningful inferences about unobservables. Under 
the canonical scale the example shows that the mode 
gives the best estimator of u ~E(u\y). However, the 
canonical scale does not exist in general. In HGLMs 
Lee and Nelder (2005) show that maintaining in- 
variance of inference from extended likelihood for 
trivial re-expressions of the underlying model leads 
to a unique definition of the h-likelihood; we call this 
the weak canonical scale in which v appears in the 
linear predictor. 

In Section 3.3 we show that APHLs are often sim- 
ilar to marginal posteriors. Given (marginal) poste- 
riors, a Bayesian would use a decision-theoretic ap- 
proach to choose estimators while we use the mode 
of the h-likelihood (an extended likelihood on a par- 
ticular scale) or its APHLs. Thus the choice of the 
scale in defining the h-likelihood is important to 
guarantee the meaningfulness of the mode estima- 
tion. Lee and Ha (2010) show that the standard er- 
ror estimators from the Hessian matrix (4.3) give 
the first-order approximation to (4.1) with tt(6) = 1 
(Kass and Steffey, 1989) and to the CMSE (Booth 
and Hobert, 1998). Let w = k(u) for some monotone 
function k(-). Ha and Lee (2006) show conditions 
when the approximation becomes better. One such 
condition is that w\y follows the normal distribu- 
tion. In GLMMs when v is normal we may expect v\y 
to be approximately normal. If normal the Laplace 
approximation is exact; we expect that proposed h- 
likelihood method works well. Figure 2 shows how to 
check the normality of the conditional distribution 
by using the APHL. 



4.3.1 Analysis of the BC infant mortality data For 
disease mapping, Leroux et al. (1999) and MacNab 
et al. (2004) consider the conditional autoregressive 
(CAR) model for the relative risk V{ which satis- 
fies v ~ N(0, £) where E = a 2 D~ l , D = XQ + (1 - 
A)I, a 2 is a dispersion parameter reflecting the over- 
all heterogeneity of the underlying risks, and A is 
a dispersion parameter for the spatial autocorrela- 
tion, A € [0, 1]. The neighborhood matrix Q has the 
jth diagonal element equal to the number of neigh- 
bors of the corresponding local region while the off- 
diagonal elements in each row are equal to —1 if 
the corresponding regions are neighbors and oth- 
erwise. 

The data consist of the number of infant deaths 
and aggregated mid-year estimates of the popula- 
tion sizes of infants for 79 local health areas. Pop- 
ulation size rii varies from 123 to 52856. For these 
data Lee et al. (2007) compare inferences from the h- 
likelihood with the full Bayes (FB) analysis. For the 
FB approach, they set priors ft ~ N(0, 1/0.00001) 
and o~~ 2 ~ gamma(0. 0001, 0.0001). Initial values are 
set as a 2 = 1, ft = and Vi = 0, and they obtain a 
posterior sample of 10,000, setting thinning at 10 us- 
ing WinBUGS (MacNab et al., 2004). The coverage 
probability is calculated by 95% Wald confidence 
intervals based upon asymptotic normality for the 
relative risks (v ) using EB and h-likelihood, and in 
the FB method by equal-tail 95% credible intervals, 
the interval between the 2.5th and 97.5th percentiles 
of the posterior distribution as given by WinBUGS. 
For the FB method we use 10,000 iterations after a 
burn-in of 2000. 

Lee et al. (2007) did a simulation study, assuming 
Hi and neighborhood structures identical to those 
in the BC infant mortality; the data were generated 
based on (1.1) and (3.1) with (3 = -4.920, a 2 = 2 
and A = 0.62. Using a graph similar to Figure 4, 
they showed that the EB coverage probability de- 
creases dramatically as the population size rii in- 
creases, but that both the h-likelihood and FB meth- 
ods improve the EB method substantially by ac- 
counting for the uncertainty in estimating fixed pa- 
rameters. However, the coverage probability of FB 
also decreases as rij increases while the h-likelihood 
maintains the stated level of confidence. When rii 
becomes larger the priors for the dispersion param- 
eters in the FB may cause problems in frequen- 
tist coverage probability. The h-likelihood procedure 
maintains the frequentist coverage probabilities bet- 
ter in this problem. The h-likelihood method is su- 
perior to Ainsworth and Dean's (2006) penalized 
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quasi-likelihood (Lee et al., 2007) for spatial GLMMs 
and Ma and Jorgensen's (2007) orthodox BLUP 
method (Lee and Ha, 2010) for nonnormal Tweedie 
models. 

4.4 Inferences and Model Identif iability 

The joint model for fg(y,v) leads to a marginal 
model fe(y) for the observed data. We regard fg(y, v) 
as the fundamental model from which the marginal 
model can be made. However, different models for 
unobservables in fe(y,v) can lead to the same 
marginal model fe(y) so that care is necessary in 
making inferences about unobservables. Some model 
assumptions can be checked from the data while 
some cannot. This could be an advantage of objec- 
tive inference with the likelihood, where uncheck- 
able model assumptions cannot be identifiable. In 
Bayesian analysis, priors can give information on 
unidentifiable model assumptions so that it is hard 
to know whether the information is coming entirely 
from the uncheckable priors. 

In the modeling of incomplete data we may as- 
sume the missing data to be "missing not at ran- 
dom" (MNAR) or "assume random missingness" 
(MAR). Here assumptions for the missing mech- 
anism cannot be checked by using observed data 
[Rubin (2006)]. Molenberghs et al. (2007) further 
show that an empirical distinction between MAR 
and MNAR is not possible because each MNAR 
model fits to a set of observed data can be repro- 
duced exactly by its counterpart. Such a pair of 



models will produce identical estimates for the ob- 
served data but give different estimates for the unob- 
servables (missing data). Assumptions about unob- 
servables (missing data) are not checkable without 
additional information. Unless we have a side-study 
to determine whether the observation process de- 
pends on what would be observed, all we have is 
a model-based assessment. As a referee has pointed 
out, it will contain some unverifiable assumptions. 

In HGLMs model assumptions for unobservables 
are often verifiable, that is, checkable, by using the 
data because the unobservables are latent variables 
for observed data. Consider the one-way random- 
effect model, 

yij = /3 + Vi + eij, 

where Vi ^ iV(0, A) and e$j ^ N(0,cf>), with vi and 
ejj uncorrelated. With more than one observation 
in each group the within-group error components V{ 
and eij are separately estimable, providing variance- 
component estimates for the dispersion parameters. 
Here model parameters <j) and A connect the ob- 
served data and unobservables. Lee and Nelder 
(2006b) show that if there are different random- 
effect models giving the same induced marginal model 
for the observed data, then the h-likelihood infer- 
ences give equivalent inferences for equivalent pairs 
of objects, including unobservables. This model leads 
to a marginal model, namely the following compound- 
symmetric model: 

Yi~N(ip,\J ni +4>I ni ). 




Fig. 4. Coverage probabilities of the EB (left) , FB and h-likelihood (right) methods with respect to population size in the 
infant mortality data. 
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A compound-symmetry model with negative cor- 
relation A < is perfectly natural in a variety of 
settings (Nelder, 1954) which can be tested by the 
marginal likelihood (or APHL). Such a model can 
be covered by HGLMs if we allow a negative vari- 
ance, but then many unanswered questions arise, 
such as estimability of random effects, etc.; these 
require further research. 

Wilk and Kempthorne (1957) and Cox (1958) study 
the randomization theory of the Latin square, pay- 
ing particular attention to the effects on the inter- 
pretation of the conventional analysis of variance 
(AN OVA) of the absence of unit-treatment additiv- 
ity, a point first raised by Neyman (1935). Consider 
a model for the Latin-square design, 

Vij(k) = H + n + cj + r k 

(4.4) 

+ ( rc )ij + ( rt )ik + [ct)jk + e-ijik)- 

Suppose that the main effects are regarded as fixed. 
When the interactions (rc)jj, (rt)^, (ct)j k are fixed a 
test for the main effect is irrelevant because it makes 
no sense to postulate that either of the two main ef- 
fects is null when their interaction is not assumed 
zero (Nelder, 1994). However, if the interactions are 
regarded as random the associated main effects can 
tested without any difficulty from the ANOVA ta- 
ble. Permutation from a finite population is a way 
of generating distributions for random effects. Wilk 
and Kempthorne (1957) put constraints J2i( rc )ij = 
Ej{rc)ij = ■■■ = Ek( ct )jk = 0. Nelder (1994) points 
out that such constraints make no sense either with 
fixed or random effects. With fixed effects the choice 
of constraints to give the least-square equations a so- 
lution is essentially arbitrary. However, with random 
effects symmetric constraints on estimates of the pa- 
rameters of the form X^( rc )«? = ^Zj ( rc )ij = ••• = 

(°t)jk — arise naturally (Lee and Nelder, 1996, 
2005). However, here only fractions of combinations 
are used to make the combined error component 
Vij(k) = {rc)ij + (rt) ik + {ct)jk + eij{k) to form a sum 
of independent errors. Thus model (4.4) gives an 
identical marginal model to the conventional model 
for Latin squares with main effects only 

(4.5) y ij{k) = n + n + Cj + T k + e* j{k) . 

From Lee and Nelder (2006b) the two models lead 
to identical inferences about both fixed parameters 
and random effects, giving e.*-^ = i>ij(k)- Thus in 
(4.4) individual error components cannot be sepa- 
rated by the observed data. If a method can identify 



individual components then it must be based upon 
uncheckable model assumptions such as priors. Con- 
sider the following model: 

( 4 -6) Vij{k) = fi + ri + Cj + T ij{k) + e ij(k) , 

where Tjj(fc) = r k + (rt) ik + (ct) jk and (rt) ik and (ct) jk 
are random with zero means. This model assumes 
unit-treatment interaction and can be interpreted 
to have the average treatment effects such that 

E ( T *j(fc)) = T k- 

Then we can test that the average treatment effects 
are the same (Lee and Nelder, 2002). Thus with un- 
observables there are different methods of interpre- 
tation: we may consider {rt)i k and (ct)j k to be either 
error components or random treatment-unit interac- 
tions. These give equivalent inferences for equivalent 
quantities. 

4.5 Discussion 

There have been many alleged examples similar to 
that of Bayarri et al. (1988) and Little and Rubin 
[(2002), Chapter 6.3], purporting to show that an 
extension of the Fisher likelihood to three kinds of 
objects is not possible. Lee and Nelder (2005) refute 
those of Bayarri et al. and Yun et al. (2007) those of 
Little and Rubin. These complaints are, we believe, 
resolved by the h-likelihood framework. Zhao et al. 
(2006) claim that the Bayesian analysis is compu- 
tationally simpler for obtaining variance estimators 
for the random-effect estimates compared with its 
frequentist counterpart; however with the extended 
likelihood framework this may not be so, at least in 
the analysis of the disease-mapping areas in Section 
4.3.1. 

The h-likelihood (3.1) gives a new definition of 
conjugate families (Lee and Nelder, 2001a), show- 
ing that the likelihood for a conjugate family for 
log fg(v) takes the form of a GLM. It is the sum 
of component likelihoods, log fg{v) and log fg{y\v), 
both representable as GLM likelihoods. This means 
that an extended class of models can be decomposed 
into component GLMs (Lee and Nelder, 2001a, 2006a) 
and that these extended models can be fitted as an 
interconnected set of component GLMs. This greatly 
facilitates the development of model-checking tech- 
niques for the whole class (Lee and Nelder, 2001a). 
A single algorithm, iterative weighted least squares, 
can be used throughout all this extended class of 
models and requires neither prior distributions of 
parameters nor multi-dimensional quadrature. The 
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h-likelihood plays a key role in the synthesis of the 
computational algorithms needed for this extended 
class of models. 

This formulation means that a great variety of 
models can be fitted by a single algorithm and com- 
pared using extensions of standard GLM procedures. 
Thus we can change the link function, allow various 
types of term in the linear predictor and use model- 
selection methods for adding or deleting terms. Fur- 
thermore, various model assumptions can be checked 
by applying GLM model-checking procedures to the 
appropriate component GLMs. This establishes, we 
believe, algorithmic wiseness in the sense of Efron 
(2003). 

5. CONCLUSION 

We have shown that a broad class of new mod- 
els with wide applications can be generated by the 
probabilistic modeling of unobservables. There has 
been an attempt using the GEE method to make in- 
ferences from general non-normal multivariate mod- 
els without modeling unobservables. It pre-empts 
model selection by claiming to make inferences about 
population averages or marginal means. We do not 
disagree with the need to make marginal predictions 
after choosing a model but believe that such a need 
does not require, and indeed should not use, predic- 
tion methods at the model-selection stage. We dis- 
like the pre-emption of the model selection stage by 
a particular prediction method. Furthermore, these 
population, marginal and subject-specific averages 
are parameterizations in the probabilistic model. 
When a prediction method lacks a probabilistic model 
basis it is not possible to connect these parameters 
and compare them. 

We do not object to the use of Fisher's likelihood 
for inferences about fixed parameters. The Fisher 
likelihood framework has advantages such as gen- 
erality of application, statistical and computational 
efficiency, etc., and we agree with its use. However, 
it cannot deal with inferences from models having 
unobservables because there is always a problem of 
inference about those unobservables. H-likelihood 
gives a powerful and practical framework for sta- 
tistical inference of general model class with unob- 
servables, maintaining the advantages of the origi- 
nal likelihood framework for fixed parameters. We 
believe that more new classes of models will be de- 
veloped and that the h-likelihood will become widely 
used for inference from them. 



The h-likelihood uses the mode and its curvature 
for inferences about unobservables. Thus, in defin- 
ing the h-likelihood the scale of unobservables must 
be carefully chosen to make a valid inferences. The 
(weak) canonical scale in HGLMs leads to an imparl- 
ance of a certain extended likelihood. However, in 
general the validity of such a scale has not been es- 
tablished. The conditional normality in Section 4.3 
would be a promising condition to determine the 
scale, which can be checked by plotting the APHL. 
Further studies are required on the scale in defin- 
ing the h-likelihood under general situations beyond 
DHGLMs. For fixed parameter estimation we use 
the marginal likelihood. But it is often hard to com- 
pute, so that we have proposed using the Laplace 
approximation. However, this approximation gives 
nonnegligible biases in binary data. We have found 
that the second-order approximation is effective in 
eliminating such biases. However, it becomes very 
hard to implement as the number of random com- 
ponents increases. So it would be of interest to find 
an approximation which can be implemented under 
general situations. 
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